Bank Customer Churn Analysis Using Support Vector Machine

Gopi Shankar Mallu
Vamsi Kalla
Dinesh Donkada
Kavya Maale

2024-04-18

Introduction

Support Vector Machines (SVMs) are powerful supervised learning models used for both classification and regression tasks. By constructing a hyperplane in a high-dimensional space, SVMs achieve class separation by maximizing the margin between the closest data points of each class. This approach not only enhances the model’s accuracy but also its predictive reliability across datasets with numerous features or clear separations between classes.

Hyperplanes

- Hyperplanes are decision boundaries that help classify the data points. Data points falling on either side of the hyperplane can be attributed to different classes. Also, the dimension of the hyperplane depends upon the number of features.

Image A Shows a clear linear separation, demonstrating the SVM hyperplane in a scenario where data is linearly separable.

Image B Illustrates data that is not linearly separable in its original space but can be tackled by SVM through the use of a kernel function.

Methodology

Support Vector Machines (SVMs) serve as a robust methodology for binary classification by creating a hyperplane which acts as a decision boundary between two classes. This hyperplane is determined mathematically by the equation \(w^T.x +b=0\), where \(w\) is the weight vector perpendicular to the hyperplane, and \(b\) is the bias,shifting the hyperplane away from the origin.

Kernel Function in SVM

In Support Vector Machines (SVM), the kernel function plays a crucial role in transforming the input feature space into a higher-dimensional space where the data can be linearly separated. This is particularly useful in cases where the data is not linearly separable in its original space. The kernel function computes the dot product between the feature vectors in this higher-dimensional space without explicitly mapping the vectors into that space, which is known as the “kernel trick.”

Common types of kernel functions include:

Linear Kernel: \(K(w,b)=w^Tx+b\). This is the simplest form of the kernel, used when the data is linearly separable.

Polynomial Kernel: \(K(w, b) = (1 + w^T.x b)^d\). This kernel maps the input features into a polynomial feature space, allowing for polynomial decision boundaries.

Radial Basis Function (RBF) Kernel: \(K(x_i, x_j) = e^{-\gamma \|x_i - x_j\|^2}\). Also known as the Gaussian kernel, it maps the features into an infinite-dimensional space, providing a lot of flexibility for non-linear decision boundaries.

Each kernel function has its own set of parameters that need to be tuned for optimal performance. The choice of kernel function and its parameters can significantly impact the SVM model’s ability to capture the underlying patterns in the data.

SVM’s Objective Function and Optimization

The objective function that SVM optimizes is a combination of maximizingthe margin and minimizing the classification error. This is achievedthrough the minimization of the following objective function: \[min_{w, b} \frac{1}{2} \|w\|^2 + C \sum_{i=1}^{n} \xi_i\] Subject to the constraints: \[y_i (w^T x_i + b) \geq 1 - \xi_i \quad \text{and} \quad \xi_i \geq 0 \quad \text{for all } i\] where \(w\) is the weight vector, \(b\) is the bias term, \(C\) is theregularization parameter, \(\xi_i\) are the slack variables representingthe degree of misclassification of the \(i\)-th data point, and \(y_i\) arethe class labels.

Hinge Loss:

The hinge loss function is used in SVM to penalize misclassifications. It is defined as: Hinge loss = \(\max(0, 1 - y_i (w^T x_i + b))\) The hinge loss is zero for correctly classified points that are outside the margin, and it increases linearly for points that are on the wrong side of the hyperplane or within the margin.

The optimization of the objective function involves finding the values of \(w\) and \(b\) that minimize the function, subject to the constraints. This is typically done using quadratic programming techniques.

Objective

The main objective of this project is to utilize the Support Vector Machine (SVM) algorithm to effectively predict customer churn at ABC Multinational Bank. This predictive model aims to identify key indicators that signal the likelihood of customers opting to leave the bank. By understanding these indicators, the bank can deploy targeted interventions to improve customer satisfaction and retention. Ultimately, this effort will enable ABC Multinational Bank to take proactive measures in retaining valuable customers, thereby stabilizing their customer base and enhancing long-term business sustainability.

Analysis and Modelling

Data Description

The dataset utilized in this project was sourced from Kaggle, a platform known for providing a wide range of high-quality datasets. The attributes of dataset are: ## Data Preprocessing

Column Name Description
customer_id A unique identifier for each customer, not used in the analysis.
credit_score A numerical representation of the customer’s creditworthiness.
country The country in which the customer resides.
gender The gender of the customer (e.g., male, female).
age The age of the customer in years.
tenure The number of years the customer has been with the bank.
balance The current balance in the customer’s account.
products_number The number of products the customer has with the bank.
credit_card Indicates whether the customer has a credit card with the bank.
active_member Indicates whether the customer is an active member.
estimated_salary The estimated annual salary of the customer.
churn The target variable, indicating customer churn (1 for churned, 0 for not churned).

Exploratory Data Analysis

This dataset consists of 14 columns and 165034 rows.

[1] 165034     14
             id      CustomerId         Surname     CreditScore       Geography 
              0               0               0               0               0 
         Gender             Age          Tenure         Balance   NumOfProducts 
              0               0               0               0               0 
      HasCrCard  IsActiveMember EstimatedSalary          Exited 
              0               0               0               0 

There are no null values in the dataset.

     Geography         Gender      HasCrCard IsActiveMember         Exited 
             3              2              2              2              2 

Geography: Geography column have 3 uniques values Framce,Germany, Spain.

Gender: Gender column have 2 unique values Male, Female.

IsActiveMember: Column consists of 2 unique values yes and no,

HasCrCard: Column consists o 2 unique values yes or no indicating is user have credit card or not.

Exited: This is a target column which indicates weather the customer is exited the bank or not.

  CreditScore         Age            Tenure         Balance      
 Min.   :350.0   Min.   :18.00   Min.   : 0.00   Min.   :     0  
 1st Qu.:597.0   1st Qu.:32.00   1st Qu.: 3.00   1st Qu.:     0  
 Median :659.0   Median :37.00   Median : 5.00   Median :     0  
 Mean   :656.5   Mean   :38.13   Mean   : 5.02   Mean   : 55478  
 3rd Qu.:710.0   3rd Qu.:42.00   3rd Qu.: 7.00   3rd Qu.:119940  
 Max.   :850.0   Max.   :92.00   Max.   :10.00   Max.   :250898  
 NumOfProducts   EstimatedSalary    
 Min.   :1.000   Min.   :    11.58  
 1st Qu.:1.000   1st Qu.: 74637.57  
 Median :2.000   Median :117948.00  
 Mean   :1.554   Mean   :112574.82  
 3rd Qu.:2.000   3rd Qu.:155152.47  
 Max.   :4.000   Max.   :199992.48  

Credit Score: Ranges from 350 to 850, with a median of 659,indicating a mid-range creditworthiness among

Age: Customers’ ages range from 18 to 92 years, with a median age of 37, suggesting a predominantly middle-aged clients.

Tenure: Tenure with the bank varies from 0 to 10 years, with a median of 5 years, showing that customers are fairly evenly distributed in terms of loyalty.

Balance: Account balances range up to $250,898, but the median balance is 0, indicating that many customers maintain low or no balances.

Number of Products: Most customers have between 1 and 2 banking products, with a median of 2 products per customer.

Estimated Salary: Salaries vary widely, up to $199,992.48, with a median of $117,948, reflecting a broad spectrum of income levels among the bank’s clientele.

Data Visualization

Observations:

The largest concentration of customers falls within the 30 to 40-year-old range, indicating that the majority of customers are in their early to mid-career stages.

There is a significant drop in frequency as age increases, especially beyond 50 years. This suggests that the customer base skews younger.

The distribution is right-skewed, meaning there are fewer older customers (those over 60) compared to younger customers.

There is a small number of customers in the youngest age bracket (under 25 years) and the oldest (over 75 years).

Observations:

The y-axis represents the balance on customer accounts, which seems to range from 0 to a bit over 250,000.

Both boxes have a similar interquartile range (IQR), which is the range between the first quartile (25th percentile) and the third quartile (75th percentile), represented by the height of the boxes. This suggests that the middle 50% of balances are similarly distributed between both groups.

The median, indicated by the line within each box, is roughly at the same level for both groups, suggesting that the central tendency of balance is similar regardless of whether the customer has exited or not.

The boxplot shows no apparent outliers, as there are no data points beyond the whiskers which represent 1.5 times the interquartile range.

Observations:

France has the highest count of customers using one product, followed closely by those using two products. The number of customers using three and four products is significantly lower.

Germany shows a similar pattern to France with one and two products being the most common among customers. However, the count for one product is notably lower than in France, whereas the count for two products is slightly higher.

Spain’s pattern mirrors that of France and Germany, with one product being the most common, followed by two products. Again, three and four products are used by a considerably smaller number of customers.

Observations:

The highest churn rate is in Germany(37.9%) followed by Spain(17.22%) and least in France(16.53%)

Observations:

Both exited and non-exited customers are found across the entire range of Credit Scores and Age, but there is a noticeable density of exited customers (blue dots) in the middle age range, particularly between ages 40 and 50.

Observations:

There seems to be a noticeable positive correlation between Age and Balance, and a negative correlation between NumOfProducts and Balance.

Observations:

We can see there is significant imbalance in data this is addressed during model building.

Data preparation for modelling

One Hot Encoding: As geography is categorical column so we have performed one hot encoding to convert in to numerical coulmns for each value type.

Label Encoding: Peformed Label encoding on Gender and Exited column to convert them to numerical columns.

We split our data into training and testing sets using 70-30 rule to evaluate the performance of our models on unseen data.

To address the class imbalance in our target variable, we employed the ROSE package, which implements a combination of over-sampling and under-sampling techniques. This method effectively created a more balanced distribution of classes, ensuring better representation of both majority and minority groups in our dataset.

Normalization is a crucial preprocessing step that helps in standardizing the data values within a specific range, typically zero mean and unit variance. This is particularly important for algorithms that assume data is normally distributed, or for those that are sensitive to the scale of the input features, such as Support Vector Machines or k-nearest neighbors.

In normalization we have calculated the mean and standard deviation for selected numerical columns in the training data, using the apply function in R to avoid data leakage from the test set. The training data is then standardized by subtracting the mean and dividing by the standard deviation for each column, a method known as Z-score normalization. This same scaling approach is applied to the test set to ensure both datasets are on a comparable scale, facilitating more accurate model training and evaluation.

Statistical Modelling

In this modeling phase, two different machine learning models are trained for a classification task using the caret package in R. The models include Support Vector Machine (SVM) with a radial basis function kernel and the Random Forest model.

The target variable Exited is converted to a factor to ensure that it is treated as a categorical variable for classification. A consistent seed (set.seed(123)) is set before training each model to ensure reproducibility of the results.

The trainControl function is used to set up the training control parameters, specifying 5-fold cross-validation (method = “cv”) to assess the performance of the models.

cross-validation is a statistical technique used to evaluate the performance and stability of machine learning models. In this approach, the data set is randomly partitioned into \(K\) equal or nearly equal sized sub-datasets or “folds.” The model is then trained and tested \(K\) times, with each of the five folds used exactly once as the validation data, and the remaining four folds used for training.

Benefits of K-Fold Cross-Validation

Reduces Overfitting: By using different subsets of the data for training and validation, K-fold cross-validation reduces the risk of the model overfitting to a specific portion of the data.

Improves Model Generalizability: Since the model is validated multiple times against different subsets of data, it ensures that the model performs well across various samples of the data, not just on the data it was trained on. This helps in assessing the model’s ability to generalize to new, unseen data.

Efficient Use of Data: Unlike a simple train/test split, cross-validation allows every observation in the dataset to be used for both training and validation. This is especially beneficial when dealing with limited data resources, as it maximizes the amount of training data available.

Reliable Performance Estimation: Each data point gets to be in a test set exactly once and in a training set four times. This comprehensive involvement ensures that the performance metric you compute over the folds is more reliable and robust, as it incorporates a wider range of scenarios.

Minimizes Bias: The random shuffling and partitioning of data into folds help minimize bias associated with the order or any potential patterns in the data collection process. This randomization helps ensure that the validation process is as impartial as possible.

Data Preprocessing: Centering and scaling are crucial for SVM performance because it ensures that all features contribute equally to the distance calculations in the feature space.

Model Overview: SVM is a powerful classification technique that works by finding a hyperplane in an N-dimensional space (N — the number of features) that distinctly classifies the data points. For non-linearly separable data, SVM uses a kernel trick to transform the data into a higher-dimensional space where a hyperplane can be used for separation.

Radial Basis Kernel Function: The SVM model uses a Radial Basis Function (RBF) kernel to handle non-linear separation between classes. This kernel function is defined as

\[ K(x_i, x_j) = e^{-\gamma \|x_i - x_j\|^2} \] Here, \(xi\) and \(xj\) are two feature vectors in the input space, γ is a parameter that defines the spread of the kernel, and \(\|x_i - x_j\|^2\) is the squared Euclidean distance between the two feature vectors.

After Incorporating our data columns, the SVM’s RBF kernel equation looks like below.

\[ K(x_i, x_j) = e^{-\gamma (\|CreditScore_i - CreditScore_j\|^2 + \|Age_i - Age_j\|^2 + \|Balance_i - Balance_j\|^2 + \|ProductsNumber_i - ProductsNumber_j\|^2 + \|EstimatedSalary_i - EstimatedSalary_j\|^2 + \ldots)} \]

The trained models are tested on Test set and compared the performances of both models.

Evaluation Metrics

Accuracy: The percentage of total customers correctly predicted as churned or not churned by the model.

Sensitivity (Recall): The proportion of actual churned customers that the model correctly identifies (True Positives).

Specificity: The proportion of customers who have not churned that the model correctly predicts as not exiting (True Negatives).

Kappa: The measure of agreement between the churn predictions and actual churn instances, corrected for chance agreement.

Positive Predictive Value (Precision): The proportion of customers who have not churned that the model correctly predicts as not exiting (True Negatives).

Negative Predictive Value: The likelihood that a customer predicted to not churn by the model has indeed not churned.

Balanced Accuracy: An average of the model’s ability to correctly identify both churned and retained customers, crucial for datasets where churn events are less common.

AUC-ROC: The probability that the model ranks a randomly chosen churned customer higher than a randomly chosen customer who hasn’t churned, indicating how well the model distinguishes between the two groups.

Results

Metric SVM Random Forest
Accuracy 79.52% 83.38%
Kappa 0.494 0.5375
Sensitivity 79.11% 86.48%
Specificity 81.09% 71.70%
Positive Predictive Value 94.02% 91.99%
Negative Predictive Value 50.82% 58.54%
Balanced Accuracy 80.10% 79.09%
AUC-ROC 0.801 0.791

Conclusion

In this analysis of bank customer churn prediction, the Support Vector Machine (SVM) model has shown promising results, particularly in terms of specificity (81.09%) and positive predictive value (94.02%). These metrics are crucial in the banking context, as they indicate the model’s accuracy in correctly identifying loyal customers (specificity) and its reliability in flagging potential churners (positive predictive value). Additionally, the SVM model’s highest AUC-ROC score of 0.801 underscores its effectiveness in distinguishing between customers who are likely to churn and those who are not across various decision thresholds.

Although the Random Forest model exhibited the highest overall accuracy (83.38%) and kappa score (0.5375), its lower specificity and negative predictive value compared to the SVM model suggest it may produce more false positives, leading to misallocated retention efforts.

Thank You